The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
# Importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Importing classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
# Importing imputers and model evaluation metrics
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import (
accuracy_score,
f1_score,
precision_score,
recall_score,
)
# Importing resampling techniques
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Importing metrics
from sklearn import metrics
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
original_data = pd.read_csv("BankChurners.csv")
df = original_data.copy()
print(f"Number of rows: {df.shape[0] }, Number of columns {df.shape[1]}")
Number of rows: 10127, Number of columns 21
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
df.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | ... | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | ... | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | ... | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | ... | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | ... | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
5 rows × 21 columns
CLIENTNUM consists of uniques ID for clients and is not relevant
df.drop(["CLIENTNUM"], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
Count the duplicated data
df[df.duplicated()].count()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Count null values
df.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.0 | NaN | NaN | NaN | 46.32596 | 8.016814 | 26.0 | 41.0 | 46.0 | 52.0 | 73.0 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.0 | NaN | NaN | NaN | 2.346203 | 1.298908 | 0.0 | 1.0 | 2.0 | 3.0 | 5.0 |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.0 | NaN | NaN | NaN | 35.928409 | 7.986416 | 13.0 | 31.0 | 36.0 | 40.0 | 56.0 |
| Total_Relationship_Count | 10127.0 | NaN | NaN | NaN | 3.81258 | 1.554408 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| Months_Inactive_12_mon | 10127.0 | NaN | NaN | NaN | 2.341167 | 1.010622 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Contacts_Count_12_mon | 10127.0 | NaN | NaN | NaN | 2.455317 | 1.106225 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Credit_Limit | 10127.0 | NaN | NaN | NaN | 8631.953698 | 9088.77665 | 1438.3 | 2555.0 | 4549.0 | 11067.5 | 34516.0 |
| Total_Revolving_Bal | 10127.0 | NaN | NaN | NaN | 1162.814061 | 814.987335 | 0.0 | 359.0 | 1276.0 | 1784.0 | 2517.0 |
| Avg_Open_To_Buy | 10127.0 | NaN | NaN | NaN | 7469.139637 | 9090.685324 | 3.0 | 1324.5 | 3474.0 | 9859.0 | 34516.0 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | NaN | NaN | NaN | 4404.086304 | 3397.129254 | 510.0 | 2155.5 | 3899.0 | 4741.0 | 18484.0 |
| Total_Trans_Ct | 10127.0 | NaN | NaN | NaN | 64.858695 | 23.47257 | 10.0 | 45.0 | 67.0 | 81.0 | 139.0 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | NaN | NaN | NaN | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
# Attrition_Flag
df["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
df["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
# Convert the columns with an 'object' datatype into categorical variables
for feature in df.columns:
if df[feature].dtype == 'object':
df[feature] = pd.Categorical(df[feature])
# Iterate over columns with 'category' data type in the DataFrame
for i in df.describe(include='category').columns:
print("Unique values in", i, "are :")
print(df[i].value_counts())
print("*" * 50)
Unique values in Gender are : Gender F 5358 M 4769 Name: count, dtype: int64 ************************************************** Unique values in Education_Level are : Education_Level Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: count, dtype: int64 ************************************************** Unique values in Marital_Status are : Marital_Status Married 4687 Single 3943 Divorced 748 Name: count, dtype: int64 ************************************************** Unique values in Income_Category are : Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: count, dtype: int64 ************************************************** Unique values in Card_Category are : Card_Category Blue 9436 Silver 555 Gold 116 Platinum 20 Name: count, dtype: int64 **************************************************
Gender:
Education_Level:
Marital_Status:
Income_Category:
Card_Category:
df["Income_Category"].replace("abc", np.nan, inplace=True)
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title(
"Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title(
"Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor,
ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# histogram_box plot for all the numerical values
for column in df.columns:
if df[column].dtypes != 'category':
histogram_boxplot(df, column)
Attrition_Flag Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
#histogram_box plot for all the categorical values
for column in df.columns:
if df[column].dtypes == 'category':
labeled_barplot(df, column)
sns.pairplot(df, hue="Attrition_Flag")
<seaborn.axisgrid.PairGrid at 0x7f9dfbec2310>
for column in ['Gender', 'Marital_Status', 'Education_Level', 'Income_Category', 'Contacts_Count_12_mon', 'Months_Inactive_12_mon', 'Total_Relationship_Count', 'Dependent_count']:
print('-'*50)
print(f'{column} vs Attrition_Flag')
stacked_barplot(df, column, "Attrition_Flag")
-------------------------------------------------- Gender vs Attrition_Flag Attrition_Flag 0 1 All Gender All 8500 1627 10127 F 4428 930 5358 M 4072 697 4769 ------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------- Marital_Status vs Attrition_Flag Attrition_Flag 0 1 All Marital_Status All 7880 1498 9378 Married 3978 709 4687 Single 3275 668 3943 Divorced 627 121 748 ------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------- Education_Level vs Attrition_Flag Attrition_Flag 0 1 All Education_Level All 7237 1371 8608 Graduate 2641 487 3128 High School 1707 306 2013 Uneducated 1250 237 1487 College 859 154 1013 Doctorate 356 95 451 Post-Graduate 424 92 516 ------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------- Income_Category vs Attrition_Flag Attrition_Flag 0 1 All Income_Category All 7575 1440 9015 Less than $40K 2949 612 3561 $40K - $60K 1519 271 1790 $80K - $120K 1293 242 1535 $60K - $80K 1213 189 1402 $120K + 601 126 727 ------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------- Contacts_Count_12_mon vs Attrition_Flag Attrition_Flag 0 1 All Contacts_Count_12_mon All 8500 1627 10127 3 2699 681 3380 2 2824 403 3227 4 1077 315 1392 1 1391 108 1499 5 117 59 176 6 0 54 54 0 392 7 399 ------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------- Months_Inactive_12_mon vs Attrition_Flag Attrition_Flag 0 1 All Months_Inactive_12_mon All 8500 1627 10127 3 3020 826 3846 2 2777 505 3282 4 305 130 435 1 2133 100 2233 5 146 32 178 6 105 19 124 0 14 15 29 ------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------- Total_Relationship_Count vs Attrition_Flag Attrition_Flag 0 1 All Total_Relationship_Count All 8500 1627 10127 3 1905 400 2305 2 897 346 1243 1 677 233 910 5 1664 227 1891 4 1687 225 1912 6 1670 196 1866 ------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------- Dependent_count vs Attrition_Flag Attrition_Flag 0 1 All Dependent_count All 8500 1627 10127 3 2250 482 2732 2 2238 417 2655 1 1569 269 1838 4 1314 260 1574 0 769 135 904 5 360 64 424 ------------------------------------------------------------------------------------------------------------------------
for column in ['Total_Revolving_Bal', 'Credit_Limit', 'Customer_Age', 'Total_Trans_Ct', 'Total_Trans_Amt', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio', 'Months_on_book']:
print('-'*50)
print(f'{column} vs Attrition_Flag')
distribution_plot_wrt_target(df, column, "Attrition_Flag")
-------------------------------------------------- Total_Revolving_Bal vs Attrition_Flag
-------------------------------------------------- Credit_Limit vs Attrition_Flag
-------------------------------------------------- Customer_Age vs Attrition_Flag
-------------------------------------------------- Total_Trans_Ct vs Attrition_Flag
-------------------------------------------------- Total_Trans_Amt vs Attrition_Flag
-------------------------------------------------- Total_Ct_Chng_Q4_Q1 vs Attrition_Flag
-------------------------------------------------- Avg_Utilization_Ratio vs Attrition_Flag
-------------------------------------------------- Months_on_book vs Attrition_Flag
histogram_boxplot(df, 'Total_Trans_Amt')
labeled_barplot(df, 'Education_Level')
Most of the customers has a Graduate and High school education level
labeled_barplot(df, 'Income_Category')
Most of the customer has an income less than $40k
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?distribution_plot_wrt_target(df, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?distribution_plot_wrt_target(df, 'Months_Inactive_12_mon', "Attrition_Flag")
columns = []
for column in df.columns:
if df[column].dtypes != 'category':
columns.append(column)
plt.figure(figsize=(15, 7))
sns.heatmap(df[columns].corr(), annot=True, vmin=-
1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Avg_Open_To_Buy and Credit_Limit: These variables have a perfect positive correlation of 1. It indicates that as the Avg_Open_To_Buy increases, the Credit_Limit also increase.
Total_Trans_Amt and Total_Trans_Ct: These variables have a strong positive correlation of 0.807192. It indicates that as the total transaction amount increases, the total transaction count also tends to increase.
Months_on_book and Customer_Age: These variables have a strong positive of 0.788912. It indicates that customers with a higher Months_on_book tend to have a higher Customer_Age.
On the other hand, the following high negative correlations can be observed:
Total_Ct_Chng_Q4_Q1 and Attrition_Flag: These variables have a moderate negative correlation of -0.290054. It suggests that customers who have a higher rate of change in the number of transactions between Q4 and Q1 are less likely to churn (attrit).
Credit_Limit and Avg_Utilization_Ratio: These variables have a moderate negative correlation of -0.482965. It indicates that as the credit limit increases, the average utilization ratio tends to decrease.
data = df.copy()
data.isna().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
x = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"]
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test
x_temp, x_test, y_temp, y_test = train_test_split(
x, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
x_train, x_val, y_train, y_val = train_test_split(
x_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(x_train.shape, x_val.shape, x_test.shape)
(6075, 19) (2026, 19) (2026, 19)
Dividing train data into x and y
# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# fit and transform the imputer on train data
x_train[cols_to_impute] = imp_mode.fit_transform(x_train[cols_to_impute])
# Transform on validation and test data
x_val[cols_to_impute] = imp_mode.transform(x_val[cols_to_impute])
# fit and transform the imputer on test data
x_test[cols_to_impute] = imp_mode.transform(x_test[cols_to_impute])
# Creating dummy variables for categorical variables
x_train = pd.get_dummies(data=x_train, drop_first=True)
x_val = pd.get_dummies(data=x_val, drop_first=True)
x_test = pd.get_dummies(data=x_test, drop_first=True)
for x in [x_train, x_val, x_test]:
print('-'*50)
print(x.isnull().sum())
-------------------------------------------------- Customer_Age 0 Dependent_count 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 Gender_M 0 Education_Level_Doctorate 0 Education_Level_Graduate 0 Education_Level_High School 0 Education_Level_Post-Graduate 0 Education_Level_Uneducated 0 Marital_Status_Married 0 Marital_Status_Single 0 Income_Category_$40K - $60K 0 Income_Category_$60K - $80K 0 Income_Category_$80K - $120K 0 Income_Category_Less than $40K 0 Card_Category_Gold 0 Card_Category_Platinum 0 Card_Category_Silver 0 dtype: int64 -------------------------------------------------- Customer_Age 0 Dependent_count 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 Gender_M 0 Education_Level_Doctorate 0 Education_Level_Graduate 0 Education_Level_High School 0 Education_Level_Post-Graduate 0 Education_Level_Uneducated 0 Marital_Status_Married 0 Marital_Status_Single 0 Income_Category_$40K - $60K 0 Income_Category_$60K - $80K 0 Income_Category_$80K - $120K 0 Income_Category_Less than $40K 0 Card_Category_Gold 0 Card_Category_Platinum 0 Card_Category_Silver 0 dtype: int64 -------------------------------------------------- Customer_Age 0 Dependent_count 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 Gender_M 0 Education_Level_Doctorate 0 Education_Level_Graduate 0 Education_Level_High School 0 Education_Level_Post-Graduate 0 Education_Level_Uneducated 0 Marital_Status_Married 0 Marital_Status_Single 0 Income_Category_$40K - $60K 0 Income_Category_$60K - $80K 0 Income_Category_$80K - $120K 0 Income_Category_Less than $40K 0 Card_Category_Gold 0 Card_Category_Platinum 0 Card_Category_Silver 0 dtype: int64
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
Model_List = []
Model_List.append(("Bagging", BaggingClassifier, {'random_state': 1}))
Model_List.append(
("Random forest", RandomForestClassifier, {'random_state': 1}))
Model_List.append(("GBM", GradientBoostingClassifier, {'random_state': 1}))
Model_List.append(("Adaboost", AdaBoostClassifier, {'random_state': 1}))
Model_List.append(("Xgboost", XGBClassifier, {
'random_state': 1, 'eval_metric': "logloss"}))
Model_List.append(("dtree", DecisionTreeClassifier, {'random_state': 1}))
Models = [] # Empty list to store all the models
# Appending models into the list
for name, model, params in Model_List:
Models.append((name, model(**params)))
for name, model in Models:
print('-'*50)
print(name)
model.fit(x_train, y_train)
print("\nTraining performance:")
train = model_performance_classification_sklearn(
model, x_train, y_train
)
print(train)
print("Validation performance:")
validation = model_performance_classification_sklearn(
model, x_val, y_val
)
print(validation)
-------------------------------------------------- Bagging Training performance: Accuracy Recall Precision F1 0 0.997202 0.985656 0.996891 0.991242 Validation performance: Accuracy Recall Precision F1 0 0.956071 0.812883 0.904437 0.85622 -------------------------------------------------- Random forest Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.956565 0.797546 0.921986 0.855263 -------------------------------------------------- GBM Training performance: Accuracy Recall Precision F1 0 0.97284 0.875 0.952062 0.911906 Validation performance: Accuracy Recall Precision F1 0 0.967423 0.855828 0.936242 0.894231 -------------------------------------------------- Adaboost Training performance: Accuracy Recall Precision F1 0 0.957366 0.826844 0.899666 0.861719 Validation performance: Accuracy Recall Precision F1 0 0.961994 0.852761 0.905537 0.878357 -------------------------------------------------- Xgboost Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.969398 0.883436 0.923077 0.902821 -------------------------------------------------- dtree Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.938796 0.815951 0.806061 0.810976
Bagging:
Random Forest:
GBM (Gradient Boosting Machine):
Adaboost:
Xgboost:
Decision Tree (dtree):
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
x_train_over, y_train_over = sm.fit_resample(x_train, y_train)
for name, model in Models:
print('-'*50)
print(name)
model.fit(x_train_over, y_train_over)
print("\nTraining performance:")
train = model_performance_classification_sklearn(
model, x_train_over, y_train_over
)
print(train)
print("Validation performance:")
validation = model_performance_classification_sklearn(
model, x_val, y_val
)
print(validation)
-------------------------------------------------- Bagging Training performance: Accuracy Recall Precision F1 0 0.998333 0.997647 0.999018 0.998332 Validation performance: Accuracy Recall Precision F1 0 0.946693 0.861963 0.81686 0.838806 -------------------------------------------------- Random forest Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.953603 0.861963 0.851515 0.856707 -------------------------------------------------- GBM Training performance: Accuracy Recall Precision F1 0 0.975583 0.979212 0.972157 0.975672 Validation performance: Accuracy Recall Precision F1 0 0.957058 0.904908 0.840456 0.871492 -------------------------------------------------- Adaboost Training performance: Accuracy Recall Precision F1 0 0.96009 0.964699 0.955888 0.960273 Validation performance: Accuracy Recall Precision F1 0 0.945706 0.90184 0.790323 0.842407 -------------------------------------------------- Xgboost Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.971866 0.92638 0.901493 0.913767 -------------------------------------------------- dtree Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.934353 0.865031 0.760108 0.809182
Bagging:
Random Forest:
GBM (Gradient Boosting Machine):
Adaboost:
Xgboost:
Decision Tree (dtree):
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
x_train_un, y_train_un = rus.fit_resample(x_train, y_train)
for name, model in Models:
print('-'*50)
print(name)
model.fit(x_train_un, y_train_un)
print("\nTraining performance:")
train = model_performance_classification_sklearn(
model, x_train_un, y_train_un
)
print(train)
print("Validation performance:")
validation = model_performance_classification_sklearn(
model, x_val, y_val
)
print(validation)
-------------------------------------------------- Bagging Training performance: Accuracy Recall Precision F1 0 0.995389 0.990779 1.0 0.995368 Validation performance: Accuracy Recall Precision F1 0 0.924975 0.929448 0.701389 0.799472 -------------------------------------------------- Random forest Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.93386 0.93865 0.728571 0.820375 -------------------------------------------------- GBM Training performance: Accuracy Recall Precision F1 0 0.974385 0.980533 0.968623 0.974542 Validation performance: Accuracy Recall Precision F1 0 0.934847 0.957055 0.725581 0.825397 -------------------------------------------------- Adaboost Training performance: Accuracy Recall Precision F1 0 0.949795 0.952869 0.947047 0.949949 Validation performance: Accuracy Recall Precision F1 0 0.928924 0.960123 0.704955 0.812987 -------------------------------------------------- Xgboost Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.938796 0.957055 0.739336 0.834225 -------------------------------------------------- dtree Training performance: Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Validation performance: Accuracy Recall Precision F1 0 0.894867 0.920245 0.616016 0.738007
Bagging:
Random Forest:
GBM (Gradient Boosting Machine):
Adaboost:
Xgboost:
Decision Tree (dtree):
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(75,150,25),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"subsample":[0.5,0.7,1],
"max_features":[0.5,0.7,1],
}
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
]
}
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
param_grid = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
param_grid = {
'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001]
}
param_grid={
'n_estimators':np.arange(50,300,50),
'scale_pos_weight':[0,1,2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.7,0.8,0.9,1]
}
Params = [] # Empty list to store all the params
# Appending params into the list
Params.append(
{
"name": "Bagging",
"params":
{
'max_samples': [0.8, 0.9, 1],
'max_features': [0.7, 0.8, 0.9],
'n_estimators': [30, 50, 70],
}
}
)
Params.append(
{
"name": "Random forest",
"params":
{
"n_estimators": [200, 250, 300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
}
)
Params.append(
{
"name": "GBM",
"params":
{
"init": [AdaBoostClassifier(random_state=1), DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(75, 150, 25),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"subsample": [0.5, 0.7, 1],
"max_features": [0.5, 0.7, 1],
}
}
)
Params.append(
{
"name": "Adaboost",
"params":
{
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
]
}
}
)
Params.append(
{
"name": "Xgboost",
"params":
{
'n_estimators': np.arange(50, 300, 50),
'scale_pos_weight': [0, 1, 2, 5, 10],
'learning_rate': [0.01, 0.1, 0.2, 0.05],
'gamma': [0, 1, 3, 5],
'subsample': [0.7, 0.8, 0.9, 1]
}
}
)
Params.append(
{
"name": "dtree",
"params":
{
'max_depth': np.arange(2, 6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes': [10, 15],
'min_impurity_decrease': [0.0001, 0.001]
}
}
)
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Best_Params = []
# Defining a tuning function
def tuning(x, y, description):
for name, model in Models:
print('-'*50)
print(name, '-', description)
# Selecting the right params
New_Params = [param for param in Params if param['name']
== name][0]['params']
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=model,
param_distributions=New_Params,
n_iter=10,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(x, y)
# SAving parameters
Best_Params.append(
{
'name': name,
'description': description,
'params': randomized_cv.best_params_,
'score': randomized_cv.best_score_
}
)
# Hyperparameter Tuning for every model
tuning(x_train, y_train, "Original data")
tuning(x_train_over, y_train_over, "Oversampled data")
tuning(x_train_un, y_train_un, "Undersampled data")
# Printing best Hyperparameters
for param in Best_Params:
print('-'*50)
print(param['name'], '-', param['description'])
print("Best parameters are {} with CV score={}:" .format(
param['params'], param['score']))
--------------------------------------------------
Bagging - Original data
Best parameters are {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.9} with CV score=0.8268079539508111:
--------------------------------------------------
Random forest - Original data
Best parameters are {'n_estimators': 250, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.7438147566718996:
--------------------------------------------------
GBM - Original data
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.849361590790162:
--------------------------------------------------
Adaboost - Original data
Best parameters are {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.8637205651491365:
--------------------------------------------------
Xgboost - Original data
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3} with CV score=0.9108267922553637:
--------------------------------------------------
dtree - Original data
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.751941391941392:
--------------------------------------------------
Bagging - Oversampled data
Best parameters are {'n_estimators': 50, 'max_samples': 0.9, 'max_features': 0.7} with CV score=0.9805870807596836:
--------------------------------------------------
Random forest - Oversampled data
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9750974619484692:
--------------------------------------------------
GBM - Oversampled data
Best parameters are {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.957253362581539:
--------------------------------------------------
Adaboost - Oversampled data
Best parameters are {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.944115722834767:
--------------------------------------------------
Xgboost - Oversampled data
Best parameters are {'subsample': 1, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 1} with CV score=0.9868623602532279:
--------------------------------------------------
dtree - Oversampled data
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 4} with CV score=0.9048877215262945:
--------------------------------------------------
Bagging - Undersampled data
Best parameters are {'n_estimators': 70, 'max_samples': 1, 'max_features': 0.8} with CV score=1.0:
--------------------------------------------------
Random forest - Undersampled data
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.930319204604919:
--------------------------------------------------
GBM - Undersampled data
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9559340659340659:
--------------------------------------------------
Adaboost - Undersampled data
Best parameters are {'n_estimators': 70, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.9436630036630037:
--------------------------------------------------
Xgboost - Undersampled data
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3} with CV score=0.9764364207221352:
--------------------------------------------------
dtree - Undersampled data
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.8934432234432235:
Results = []
def Tuned_Models(x, y, description):
print(description, "\n")
for name, model, params in Model_List:
# Selecting Hyperparameter params for every model
New_Params = [param for param in Best_Params if param['name']
== name and param['description'] == description][0]['params']
print('-'*50)
print(name)
print('Parameters:', New_Params)
# Model
Model_Tuned = model(**New_Params)
Model_Tuned.fit(x, y)
print("\nTraining performance:")
# Checking model's performance on training set
train = model_performance_classification_sklearn(
Model_Tuned, x_train, y_train
)
print(train)
print("Validation performance:")
# Checking model's performance on validation set
validation = model_performance_classification_sklearn(
Model_Tuned, x_val, y_val
)
print(validation)
# Saving results
Results.append(
{
"name": name,
"description": description,
"training": train,
"validation": validation,
"model": Model_Tuned,
"params": New_Params
}
)
Tuned_Models(x_train, y_train, "Original data")
Original data
--------------------------------------------------
Bagging
Parameters: {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.9}
Training performance:
Accuracy Recall Precision F1
0 0.999012 0.995902 0.997947 0.996923
Validation performance:
Accuracy Recall Precision F1
0 0.963475 0.855828 0.911765 0.882911
--------------------------------------------------
Random forest
Parameters: {'n_estimators': 250, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'}
Training performance:
Accuracy Recall Precision F1
0 0.997695 0.985656 1.0 0.992776
Validation performance:
Accuracy Recall Precision F1
0 0.955577 0.782209 0.930657 0.85
--------------------------------------------------
GBM
Parameters: {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)}
Training performance:
Accuracy Recall Precision F1
0 0.989794 0.956967 0.979036 0.967876
Validation performance:
Accuracy Recall Precision F1
0 0.974334 0.895706 0.941935 0.918239
--------------------------------------------------
Adaboost
Parameters: {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)}
Training performance:
Accuracy Recall Precision F1
0 0.995062 0.979508 0.989648 0.984552
Validation performance:
Accuracy Recall Precision F1
0 0.96693 0.868098 0.921824 0.894155
--------------------------------------------------
Xgboost
Parameters: {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3}
Training performance:
Accuracy Recall Precision F1
0 0.996379 1.0 0.977956 0.988855
Validation performance:
Accuracy Recall Precision F1
0 0.971372 0.957055 0.876404 0.914956
--------------------------------------------------
dtree
Parameters: {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5}
Training performance:
Accuracy Recall Precision F1
0 0.938765 0.805328 0.811983 0.808642
Validation performance:
Accuracy Recall Precision F1
0 0.930405 0.782209 0.784615 0.78341
Bagging:
Random Forest:
GBM (Gradient Boosting Machine):
Adaboost:
Xgboost:
Decision Tree (dtree):
# Checking model's performance
Tuned_Models(x_train_over, y_train_over, "Oversampled data")
Oversampled data
--------------------------------------------------
Bagging
Parameters: {'n_estimators': 50, 'max_samples': 0.9, 'max_features': 0.7}
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Validation performance:
Accuracy Recall Precision F1
0 0.959526 0.895706 0.858824 0.876877
--------------------------------------------------
Random forest
Parameters: {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'}
Training performance:
Accuracy Recall Precision F1
0 0.999342 1.0 0.995918 0.997955
Validation performance:
Accuracy Recall Precision F1
0 0.952122 0.877301 0.833819 0.855007
--------------------------------------------------
GBM
Parameters: {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)}
Training performance:
Accuracy Recall Precision F1
0 0.967078 0.936475 0.868821 0.901381
Validation performance:
Accuracy Recall Precision F1
0 0.9615 0.91411 0.856322 0.884273
--------------------------------------------------
Adaboost
Parameters: {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)}
Training performance:
Accuracy Recall Precision F1
0 0.992593 0.984631 0.969728 0.977123
Validation performance:
Accuracy Recall Precision F1
0 0.964462 0.895706 0.884848 0.890244
--------------------------------------------------
Xgboost
Parameters: {'subsample': 1, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 1}
Training performance:
Accuracy Recall Precision F1
0 0.908313 0.996926 0.637197 0.777467
Validation performance:
Accuracy Recall Precision F1
0 0.896841 0.960123 0.614931 0.749701
--------------------------------------------------
dtree
Parameters: {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 4}
Training performance:
Accuracy Recall Precision F1
0 0.91786 0.875 0.693745 0.773901
Validation performance:
Accuracy Recall Precision F1
0 0.914116 0.861963 0.685366 0.763587
Bagging:
Random Forest:
GBM (Gradient Boosting Machine):
Adaboost:
Xgboost:
Decision Tree (dtree):
Tuned_Models(x_train_un, y_train_un, "Undersampled data")
Undersampled data
--------------------------------------------------
Bagging
Parameters: {'n_estimators': 70, 'max_samples': 1, 'max_features': 0.8}
Training performance:
Accuracy Recall Precision F1
0 0.160658 1.0 0.160658 0.27684
Validation performance:
Accuracy Recall Precision F1
0 0.160908 1.0 0.160908 0.277211
--------------------------------------------------
Random forest
Parameters: {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'}
Training performance:
Accuracy Recall Precision F1
0 0.943374 1.0 0.739394 0.850174
Validation performance:
Accuracy Recall Precision F1
0 0.92695 0.929448 0.707944 0.803714
--------------------------------------------------
GBM
Parameters: {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.7, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)}
Training performance:
Accuracy Recall Precision F1
0 0.961646 1.0 0.807279 0.893364
Validation performance:
Accuracy Recall Precision F1
0 0.946693 0.96319 0.765854 0.853261
--------------------------------------------------
Adaboost
Parameters: {'n_estimators': 70, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)}
Training performance:
Accuracy Recall Precision F1
0 0.930041 0.969262 0.705444 0.816573
Validation performance:
Accuracy Recall Precision F1
0 0.929911 0.972393 0.704444 0.81701
--------------------------------------------------
Xgboost
Parameters: {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.2, 'gamma': 3}
Training performance:
Accuracy Recall Precision F1
0 0.929547 1.0 0.695157 0.820168
Validation performance:
Accuracy Recall Precision F1
0 0.909181 0.984663 0.642 0.77724
--------------------------------------------------
dtree
Parameters: {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5}
Training performance:
Accuracy Recall Precision F1
0 0.870453 0.939549 0.557447 0.699733
Validation performance:
Accuracy Recall Precision F1
0 0.867226 0.920245 0.552486 0.690449
Bagging:
Random Forest:
GBM (Gradient Boosting Machine):
Adaboost:
Xgboost:
Decision Tree (dtree):
for name, model, params in Model_List:
print('-'*50)
print(name)
Model_Results = [result for result in Results if result['name'] == name]
training = []
validation = []
# Grouping by model
for result in Model_Results:
# Grouping training results
training.append(result['training'].T)
# Grouping validation results
validation.append(result['validation'].T)
models_train = pd.concat(
training,
axis=1,
)
# Columns names
columns = ['Original data', 'Oversampled data', 'Undersampled data']
models_train.columns = columns
print('\n', 'Training')
print(models_train)
models_validation = pd.concat(
validation,
axis=1,
)
models_validation.columns = columns
print('\n', 'Validation')
print(models_validation)
--------------------------------------------------
Bagging
Training
Original data Oversampled data Undersampled data
Accuracy 0.999012 1.0 0.160658
Recall 0.995902 1.0 1.000000
Precision 0.997947 1.0 0.160658
F1 0.996923 1.0 0.276840
Validation
Original data Oversampled data Undersampled data
Accuracy 0.963475 0.959526 0.160908
Recall 0.855828 0.895706 1.000000
Precision 0.911765 0.858824 0.160908
F1 0.882911 0.876877 0.277211
--------------------------------------------------
Random forest
Training
Original data Oversampled data Undersampled data
Accuracy 0.997695 0.999342 0.943374
Recall 0.985656 1.000000 1.000000
Precision 1.000000 0.995918 0.739394
F1 0.992776 0.997955 0.850174
Validation
Original data Oversampled data Undersampled data
Accuracy 0.955577 0.952122 0.926950
Recall 0.782209 0.877301 0.929448
Precision 0.930657 0.833819 0.707944
F1 0.850000 0.855007 0.803714
--------------------------------------------------
GBM
Training
Original data Oversampled data Undersampled data
Accuracy 0.989794 0.967078 0.961646
Recall 0.956967 0.936475 1.000000
Precision 0.979036 0.868821 0.807279
F1 0.967876 0.901381 0.893364
Validation
Original data Oversampled data Undersampled data
Accuracy 0.974334 0.961500 0.946693
Recall 0.895706 0.914110 0.963190
Precision 0.941935 0.856322 0.765854
F1 0.918239 0.884273 0.853261
--------------------------------------------------
Adaboost
Training
Original data Oversampled data Undersampled data
Accuracy 0.995062 0.992593 0.930041
Recall 0.979508 0.984631 0.969262
Precision 0.989648 0.969728 0.705444
F1 0.984552 0.977123 0.816573
Validation
Original data Oversampled data Undersampled data
Accuracy 0.966930 0.964462 0.929911
Recall 0.868098 0.895706 0.972393
Precision 0.921824 0.884848 0.704444
F1 0.894155 0.890244 0.817010
--------------------------------------------------
Xgboost
Training
Original data Oversampled data Undersampled data
Accuracy 0.996379 0.908313 0.929547
Recall 1.000000 0.996926 1.000000
Precision 0.977956 0.637197 0.695157
F1 0.988855 0.777467 0.820168
Validation
Original data Oversampled data Undersampled data
Accuracy 0.971372 0.896841 0.909181
Recall 0.957055 0.960123 0.984663
Precision 0.876404 0.614931 0.642000
F1 0.914956 0.749701 0.777240
--------------------------------------------------
dtree
Training
Original data Oversampled data Undersampled data
Accuracy 0.938765 0.917860 0.870453
Recall 0.805328 0.875000 0.939549
Precision 0.811983 0.693745 0.557447
F1 0.808642 0.773901 0.699733
Validation
Original data Oversampled data Undersampled data
Accuracy 0.930405 0.914116 0.867226
Recall 0.782209 0.861963 0.920245
Precision 0.784615 0.685366 0.552486
F1 0.783410 0.763587 0.690449
Bagging:
Random Forest:
GBM:
Adaboost:
Xgboost:
Decision Tree (dtree):
XGBoost, GBM, Adaboost models trained with undersampled data have a generalised performance
results_selected_models = []
# Selecting the model
def selected_model(name):
model = [model for model in Results if model['name']
== name and model['description'] == 'Undersampled data'][0]['model']
results = model_performance_classification_sklearn(model, x_test, y_test)
results_selected_models.append(results.T)
# List of top 3 models
selected_models = ['Xgboost', 'Adaboost', 'GBM']
for model_name in selected_models:
selected_model(model_name)
#Grouping result of each model
selected_models_results = pd.concat(
results_selected_models,
axis=1
)
selected_models_results.columns = selected_models
print('Final results\n')
print(selected_models_results)
Final results
Xgboost Adaboost GBM
Accuracy 0.905726 0.926456 0.943731
Recall 0.990769 0.972308 0.972308
Precision 0.631373 0.692982 0.750594
F1 0.771257 0.809219 0.847185
Based on the performance of the different models, the GBM (Gradient Boosting Machine) model consistently shows the highest accuracy, recall, precision, and F1 score across all three data sampling techniques (original, oversampled, and undersampled). The GBM model achieves strong performance in terms of correctly predicting positive cases (recall) and precision in identifying true positive cases. It also maintains good accuracy and overall F1 score, indicating a balance between precision and recall.
#Selecting the model
model = [model for model in Results if model['name']
== 'GBM' and model['description'] == 'Undersampled data'][0]['model']
# Features names
feature_names = x_train.columns
# Getting the importances
importances = model.feature_importances_
indices = np.argsort(importances)
# Plot the feature importances
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)),
importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Considering the consistently strong performance of the GBM model across all data sampling techniques and its ability to achieve a good balance between precision and recall, it would be a suitable choice as the final model for predicting customer attrition and improving the bank's services. However, further analysis and evaluation, such as assessing the model's stability and robustness, should be conducted before making a final decision.